Toward a Cross-Linguistic Tagset

نویسنده

  • Jan Cloeren
چکیده

With the spread of large quantities of corpus data, the need has arisen to develop some standard not only for the format of'interchange of text (an issue which has already been taken up by the Text Encoding Inititiave), but also for any information added in some subsequent stage of (linguistic) enrichment. The research community has much to gain by such standardization since it will enable researchers to e~ectively access and therefore make optimal use of the results of previous work on a corpus. This paper provides some direction of thought as to the development of a standardized tagset. We focus on a minimal tagset, i.e. a tagset conraining information about wordclasses. We investigate what criteria should be met by such a tagset. On the basis of an investigation and comparison of ten different tagse~s that have been used over the years for the (wordclass) tagging of corpora, we arrive at a proposal for a cross-linguistic minima] tagset for Germanic languages I .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Common Parts-of-Speech Tagset Framework for Indian Languages

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; th...

متن کامل

Tagset Design and Inflected Languages

An experiment designed to explore the relationship between tagging accuracy and the nature of the tagset is described, using corpora in English, French and Swedish. In particular, the question of internal versus external criteria for tagset design is considered, with the general conclusion that external (linguistic) criteria should be followed. Some problems associated with tagging unknown word...

متن کامل

Etiquetario morfosintáctico del SLI para corpus de lengua gallega: aplicación al corpus paralelo TECTRA

In this article we present a complete and normalized morphosyntactic tagset for the annotation of linguistic corpora in Galician. The elaboration of this tagset, designed by the Computational Linguistics Group (SLI) of the University of Vigo, following strictly the EAGLES recommendations (Leech and Wilson, 1996), includes the creation of an intermediate tagset that allows us to establish a corr...

متن کامل

A Support Tool for Tagset Mapping

Many different tagsets are used in existing corpora; these tagsets vary according to the objectives of specific projects (which may be as far apart as robust parsing vs. spelling correction). In many situations, however, one would like to have uniform access to the linguistic information encoded in corpus annotations without having to know the classification schemes in detail. This paper descri...

متن کامل

Part-of-speech Tagset and Corpus Development for Igbo, an African Language

This project aims to develop linguistic resources to support computational NLP research on the Igbo language. The starting point for this project is the development of a new part-of-speech tagging scheme based on the EAGLES tagset guidelines, adapted to incorporate additional language internal features. The tags are currently being used in a part-of-speech annotation task for the development of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993